(APPRENTISSAGE SÉQUENTIEL : Bandits, Statistique et Renforcement

نویسنده

Odalric-Ambrym Maillard

چکیده

This thesis studies the following topics in Machine Learning: Bandit theory, Statistical learning and Reinforcement learning. The common underlying thread is the non-asymptotic study of various notions of adaptation: to an environment or an opponent in part I about bandit theory, to the structure of a signal in part II about statistical theory, to the structure of states and rewards or to some state-model of the world in part III about reinforcement learning. First we derive a non-asymptotic analysis of a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit that enables to match, in the case of distributions with finite support, the asymptotic distribution-dependent lower bound known for this problem. Now for a multi-armed bandit with a possibly adaptive opponent, we introduce history-based models to catch some weakness of the opponent, and show how one can benefit from such models to design algorithms adaptive to this weakness. Then we contribute to the regression setting and show how the use of random matrices can be beneficial both theoretically and numerically when the considered hypothesis space has a large, possibly infinite, dimension. We also use random matrices in the sparse recovery setting to build sensing operators that allow for recovery when the basis is far from being orthogonal. Finally we combine part I and II to first provide a non-asymptotic analysis of reinforcement learning algorithms such as Bellman-residual minimization and a version of Leastsquares temporal-difference that uses random projections and then, upstream of the Markov Decision Problem setting, discuss the practical problem of choosing a good model of states. Foreword: To the layman reader. One difficult exercise in research is to explain what we are actually doing to, say, “the guy in the street”, i.e. someone who is not an expert of the field and maybe not even a scientist. In this introductory chapter, we try to explain and motivate what this thesis is about. Mathematics, Computer Science, and “Informatics”. This thesis lies somewhere at the frontier between two very exciting domains. The first one is Mathematics, the second one is Informatics. Beyond the very naive separation between these two domains saying that Mathematics are interested in theorems and proofs and that Informatics are interested in computers, algorithms and complexity (that is roughly speaking time and memory performance of algorithms), it is generally not so obvious to tell what is what, especially since these two first definitions are quite narrow. Here, I intentionally use the word “Informatics” rather than the more common word “Computer science”. The reason is that “Computer science” is a misleading word, as suggests the following quote attributed to Edsger Dijkstra: “Computer science is no more about computers than astronomy is about telescopes.” The french translation of Computer Science is “Informatique” and thus conveys a different meaning: that this is a science interested in information, or better said the information conveys by some objects, and not only in computers or algorithms. Moreover the word Informatics already exists, although being generally used in combination with other words, like in Bio-informatics. More precisely, what I call “Informatics” here studies 1) how information is created or processed, 2) how information is transferred or altered between objects and 3) how to manage the objects of interest and retrieve information from them. For instance from a conventional Computer Science perspective, this is well handled by the abstract notion of a computer program that manages memory cells (bits) thanks to computer instructions written in some programming language and that produces a so-called trace like a text, an image or the solution to an equation. Thus the study of programming languages and of semantics, a specific field of theoretical computer science, are clearly important in order to understand Informatics. But now the word Computer Science is not only misleading but also restrictive, as the previous example can be seen as the result of applying Informatics to some specific objects that are here memory cells (bits), while Informatics apply to more generic objects of interest and are thus much broader than what Computer Science suggests. Let us consider some random examples: In all this section, words in italics are technical words. There are not assumed to be known and their precise meaning should not prevent the reader from understanding the global message. xii Chapter 0. Foreword: To the layman reader. • For instance, let us consider that we apply Informatics to objects that are theorems. Then how we create information corresponds to the analysis of axioms, that are the basic statements assumed to be true and used as a starting point for reasoning. How information is transferred corresponds to the ways we combine theorems and make proofs: that is basic logic or inference. Now how we retrieve information from theorems is linked to deeper notions of logic that involve technical things like λ-calculus and decidability, with some famous difficulties pointed out by Godël in the 30’s. • For a more applied example, let us consider the result of applying Informatics to objects like proteins. This opens a very exciting field of research, directly linked with Biology. Indeed studying how proteins are created is one main question underlying genomics (before translation of DNA) and part of proteomics (after). Then the way they interacts with each other is studied by proteomics. Finally fields like e.g. Virology or Pharmacy study how one can manipulate them in order to build specific biological functions. More generally, applying Informatics to other “biological units” like neurons, cells or ecosystems, etc. results in the development of a new very active field of research called “Bioinformatics”. • For a last example, let us assume that the objects we consider are the rights of people, that is one important aspect of Law. Then one can use Informatics in order to study the creation of laws, the interaction between the rights by means of contracts and then the effects of the modifications of laws on the behavior of people. The study of such a complex dynamical system that consists of many interacting objects of different types people, contracts, ownership, etc. is definitely not easy. What informatics bring Of course the benefit of Informatics here is the power of formalization, together with the development of powerful tools coming from Graph or Domain theory for instance, and the possibility to derive proofs, which is why the frontier between Mathematics is fuzzy. Actually it is even not important to tell what is what, if Informatics are a sub-field of Mathematics or if Mathematics are a sub-field of Informatics, the important thing is that Informatics enable us to analyze, understand and proof properties that concern a large diversity of topics, especially the not formalized one, and is thus a very helpful tool for the growth of precise knowledge. The informatics of “learning” Now in this thesis, we are interested in the vague notion of learning. In order to apply Informatics to such a notion, we need some underlying object of interest. One way is to consider “data” or maybe sensors. Actually the underlying object of interest does not matter here since the notion of learning is itself a bit fuzzy. What is interesting is that with such objects, we roughly recover various aspects of the very broad field of research that is naturally called “Machine learning”, and that is directly relevant to this thesis (the following words in italics refer to some key words in Machine Learning): For instance understanding how data is created or acquired is immediately identified as sampling

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptation de la matrice de covariance pour l'apprentissage par renforcement direct

Résumé : La résolution de problèmes à états et actions continus par l’optimisation de politiques paramétriques est un sujet d’intérêt récent en apprentissage par renforcement. L’algorithme PI est un exemple de cette approche, qui bénéficie de fondements mathématiques solides tirés de la commande stochastique optimale et des outils de la théorie de l’estimation statistique. Dans cet article, nou...

متن کامل

Apprentissage par renforcement pour la conception de systèmes multi-agents réactifs

A new reinforcement learning (RL) methodology for the design of reactive multi-agent systems is presented. Although dealing with realistic situated agents with local perception does not belong to the framework where convergence of RL algorithm is guaranted, in our method each agent learns individually its local behavior. The progressive aspect of learning, which pits the agents against more and...

متن کامل

Apprentissage par démonstrations : vaut-il la peine d’estimer une fonction de récompense?

Résumé : Cet article propose une étude comparative entre l’Apprentissage par Renforcement Inverse (ARI) et l’Apprentissage par Imitation (AI). L’ARI et l’AI sont deux cadres de travail qui utilisent le concept de Processus Décisionnel de Markov (PDM) et dans lesquels nous cherchons à résoudre le problème d’Apprentissage par Démonstrations (AD). L’AD est un problème où un agent appelé apprenti c...

متن کامل

Validation statistique des cartes de Kohonen en apprentissage supervisé

Résumé. En apprentissage supervisé, la prédiction de la classe est le but ultime. Plus largement, on attend d'une bonne méthodologie d'apprentissage qu'elle permette une représentation des données susceptible de faciliter la navigation de l'utilisateur dans la base d'exemples et d'aider au choix des exemples et des variables pertinents tout en assurant une prédiction de qualité dont on comprenn...

متن کامل

Filtrage bayésien de la récompense

Résumé : Une large variété de schémas d’approximation de la fonction de valeur a été appliquée à l’apprentissage par renforcement. Cependant, les approches par filtrage bayésien, qui se sont pourtant montrées efficaces dans d’autres domaines comme l’apprentissage de paramètres pour les réseaux neuronaux, ont été peu étudiées jusqu’à présent. Cette contribution introduit un cadre de travail géné...

متن کامل

Apprentissage Par Renforcement : Analyse Des Crit Eres Moyens Et Pond Er Es En Hori- Zon Fini

E : Les probl emes de d ecision pos es par l'optimisation stochastique en horizon ni en l'absence de mod ele peuvent ^ etre trait es par des m ethodes adaptatives. Dii erents algorithmes d'apprentissage par renforcement ont et e propos es, tels le Q-Learning ou le R-Learning, mais ils sont d eenis pour des probl emes a horizon innni. On propose ici une mod elisation en horizon ni avec une compa...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

(APPRENTISSAGE SÉQUENTIEL : Bandits, Statistique et Renforcement

نویسنده

چکیده

منابع مشابه

Adaptation de la matrice de covariance pour l'apprentissage par renforcement direct

Apprentissage par renforcement pour la conception de systèmes multi-agents réactifs

Apprentissage par démonstrations : vaut-il la peine d’estimer une fonction de récompense?

Validation statistique des cartes de Kohonen en apprentissage supervisé

Filtrage bayésien de la récompense

Apprentissage Par Renforcement : Analyse Des Crit Eres Moyens Et Pond Er Es En Hori- Zon Fini

عنوان ژورنال:

اشتراک گذاری